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Abstract 

When  fitting  a  linear  regression  model,  deleting  a  group  of  observa- 
tions from  the  data  may  decrease  the  precision  of  parameter  estimates 
substantially  or,  worse,  render  some  parameters  inestimable.  These 
effects  occur  when  the  deletion  of  the  group  degrades  the  rank  of  the 
model  matrix  for  the  regression.  We  present  theory  and  methods  for 
identifying       such       rank-influential       groups       of       observations. 
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1      Introduction 

Let  (y,  X)  denote  a  data  set  where  X  is  an  n  x  p  matrix  of  values 
of  p  explanatory  variables  for  each  of  n  cases  and  y  is  an  n  — vector 
of  values  for  a  single  response  variable  for  each  case.  When  fitting 
the  linear  regression  model  y  =  X0  +  e,  where  0  is  the  p— vector 
of  regression  parameters  cind  e  is  an  n-vector  of  regression  errors, 
the  assumption  that  X  has  full  rank  is  necessary  for  the  estimability 
of  every  component  of  0.  Even  if  X  hs  full  rank,  it  may  contain  a 
group  of  cases,  /,  whose  exclusion  from  the  data  would  render  X 
rank  deficient.  The  presence  of  such  a  group  in  the  data  determines 
whether  the  linear  model  parameters  are  all  estimable.  We  shall  call 
such  groups  "rank  influential." 

Identifying  rank-influential  groups  is  important  both  before  col- 
lecting the  response  variable  data  and  afterward,  during  the  analysis. 
Suppose  a  full-rank  model  matrix  X  is  proposed,  but  a  case  group  / 
exists  for  which  the  reduced  matrix  Xij^  is  rank  deficient.  If  there  is 
a  nonnegligible  probability  that  the  response  variable  will  be  missing 
for  all  cases  in  /,  then  one  might  prefer  to  add  more  cases  to  the 
design  to  protect  against  the  possibility  that  model  parameters  Eire 
inestimable.  When  analyzing  the  fit  of  a  linear  model,  it  could  be 
important  to  know  that  a  group  of  cases  is  strongly  influential  in  that 
its  absence  from  the  data  would  render  certain  parameter  estimates 
unavailable.  Additional  data  collection  £ind  analysis  might  be  appro- 
priate because  inferences  based  on  these  estimates  depend  critically 
on  data  from  cases  in  the  rank-influential  group. 

The  importance  of  rank-influential  case  groups  has  been  acknowl- 
edged in  the  literature;  see,  for  example,  Cook  and  Weisberg  (1980, 
1982)  and  Belsley,  Kuh,  and  Welsch  (1980).  However,  no  practical 
method  has  been  proposed  for  identifying  such  case  groups  except  in 
the  special  case  of  groups  of  size  one. 

In  Section  2,  we  address  the  issue  of  estimability  and  rank  of  the 
model  matrix  when  fitting  a  linear  regression  model.  We  present 
mathematical  characterizations  of  rank-influential  case  groups  in  Sec- 
tion 3.  Section  4  proposes  two  methods  for  identifying  such  groups. 
These  methods  rely  on  factor-analytic  rotation  algorithms  and  projec- 
tion pursuit  techniques  and  are  computationally  feasible  alternatives 
to  direct  combinatorial  search  algorithms.  In  addition  to  identifying 
rank-influential  groups,  the  methods  also  identify  case  groups  which 


are  potentially  very  influential  on  estimates  of  model  parameters.  Sec- 
tion 4  also  details  how  the  methods  perform  on  both  simulated  and 
real  data.  Section  5  concludes  with  some  discussion  of  related  diag- 
nostic methods. 

2  Estimability  and  Rank  Deficiency 
in  Linear  Models 

Suppose  the  dependence  of  y  on  X  follows  a  linear  model: 

y    =    Xp  +  e  (1) 

where  /?  =  [/3i,^2,--  ■  ^PpY  is  the  vector  of  fixed  unknown  regression 
parameters  and  e  =  (ei,  62,  •  •  ■  ,fn)^  is  the  vector  of  random  errors 
with  E{t)  =  0  and  Cot;(e)  =  E{ee'^)  =  <t^/„,  where  7„  is  the  order-n 
identity  matrix  and  a"^  is  the  unknown,  positive  variance  of  the  errors. 

It  is  well  known  that  the  regression  parameter  /?  is  estimable  only 
if  X  has  full  (column)  rank.  If  X  does  not  have  full  rank,  then  for  any 
/3  G  5R^  specifying  the  linear  model  (1),  there  exist  (infinitely  many) 
/3V  /?  such  that  X^  =  XI3'. 

One  method  of  determining  whether  a  matrix  X  has  full  rank  is 
to  calculate  |X^X|,  the  determinant  of  the  p  x  p  matrix  X'^ X.  The 
determinant  is  non-zero  if  and  only  if  X  has  full  (column)  rank  p  <  n. 
To  interpret  magnitudes  of  the  determinant,  note  that  when  the  dis- 
tribution of  e  is  multivariate  normal,  the  volume  of  a  confidence  ellip- 
soid for  /9  is  proportional  to  \X'^X\~'^.  The  smaller  the  determinant 
IX-^X],  the  larger  the  confidence  ellipsoid  of  a  given  level  for  /?,  and 
in  the  limit  when  X  has  deficient  rank,  the  volume  of  any  confidence 
ellipsoid  is  infinite. 

Another  method  for  assessing  the  rank  of  a  matrix  X  is  based  on 
the  singular  value  decomposition  (SVD)  of  X.  Golub  (1969),  Belsley, 
Kuh,  and  Welsch  (1980),  and  Stewart  (1984)  discuss  the  SVD.  The 
rank  of  a  matrix  X  is  the  number  of  non-zero  singular  values  in  the 
SVD  of  X.  So  X  has  full  rank  if  and  only  if  its  smallest  singular  value  is 
non-zero.  In  general,  the  singular  values  of  X  must  be  determined  by 
numerical  methods.  If  the  ratio  of  the  largest  to  the  smallest  singular 
value  of  X,  that  is  its  condition  number  k{X),  is  large  enough,  then 
the  smallest  singular  value  is  indistinguishable  from  zero  relative  to 


the  largest  singular  value.  So,  the  larger  the  condition  number,  the 
closer  a  matrix  is  to  having  deficient  rank. 

The  S  VD  provides  a  characterization  of  the  contrasts  of  the  regres- 
sion parameter  /3  which  are  estimable/inestimable.  Let  X  =  U DV  , 
be  a  SVD,  where  U  is  n  x  p  column  orthonormal  V  is  p  x  p  column 
orthonormal  and  D  is  p  x  p  diagonal  matrix  of  non-negative  singular 
values.  If  ^[j]  is  the  jth  column  oiV  which  corresponds  to  a  non-zero 
singular  value  of  X,  then  the  contrast  (f>  =  ^[,1/^  is  estimable.  The 
linear  subspace  spanned  by  all  estimable  contrasts  can  be  defined  by 

$={^j=/^/3,    le    row{X)}, 

where  row{X)  is  the  row-space  of  X  which  coincides  with  the  span{V^j-^  : 
Djj  ^  0,j  =  I, . . .  ,p}.  Any  contrast  (f>i  for  which  /  ^  row{X),  is  ines- 
timable. 

We  have  just  characterized  the  inestimable  contrasts  by  saying 
that  they  are  any  contrast  which  is  not  estimable.  A  more  infor- 
mative characterization  is  to  note  that  the  subspace  orthogonal  to 
row{X),  i.e.,  row{X)^  consists  of  p  — vectors  /  defining  contrasts  </>; 
for  which  the  sample  regression  provides  no  information.  These  are 
the  fundamental  quantities  which  cannot  be  estimated  with  the  in- 
formation available  in  the  sample.  An  operational  characterization  of 
these  special  contrasts  is 

$0  =  {4>i  =  I'^P,    I  e  span{Vy^  :  Djj  =  0,  j  =  1, .  . .  ,p}. 

Of  course,  it  may  be  that  while  the  matrix  X  is  rank  deficient,  the 
regression  (1)  is  still  useful  for  its  estimates  of  contrasts  <pi  G  $.  How- 
ever, we  are  particularly  concerned  about  those  situations  when  the 
sole  purpose  of  the  analysis  is  to  infer  about  contrasts  which  are  ines- 
timable or  would  be  inestimable  without  the  presence  of  a  particular 
case  group. 

3      Rank-Influential  Case  Groups:  Char- 
acterization 

For  any  group  /  =  {t'l,  J2, . . .  ,ik}  C  {1,  2, . . . ,  n}  of  A;  cases  (indices) 
let  1/(7), X(/)  denote  the  reduced  data  set  which  excludes  those  cases 
in  /.  A  group  /  is  rank  influential  if  Xij\  is  rank  deficient,  wherecis 


the  complete  model  matrix  X  has  full  rank.  We  propose  identi- 
fying rank-influential  groups  of  cases  by  finding  those  case  groups 
I  =  {«i)'2i  •  ■  •  ,»ib}j  for  which  \XTj^X(^j)\  is  zero  or  close  to  zero  rela- 
tive to  |X-^X|.  (Of  course,  if  fc  >  n-p  then  the  reduced  model  matrix 
X(,)  is  necessarily  rank  deficient.) 

For  any  group  /,  it  is  well  known  that 

|Xf^)X(/)|     =     \X'^X\x\h-Hi\  (2) 

where  h  is  the  order-A;  identity  matrix,  and  Hf  is  the  kxk  submatrix 
oi  H  =  X{X'^Xy^X'^  consisting  of  those  elements  Hij  where  both 
i  e  /  and  j  e  I.  See,  for  example,  Cook  and  Weisberg  (1982,  p.  210). 
If  Ai  >  A2  >  •  •  •  >  Ai  are  the  k  ordered  eigenvalues  of  Hj,  then 

\h-Hi\     =     n(l-A,).  (3) 

It  follows  that  |XLX(/)|/|X^X|  is  zero  (or  close  to  zero)  if  and  only 
if  Ai  is  one  (or  close  to  one).  We  shall  use  A/  to  denote  Ai,  the  largest 
eigenvalue  of  Hj. 

When  A;  =  1  or  A;  =  2,  these  eigenvalue  computations  can  be  solved 
easily,  albeit  separately  for  each  case  group.  For  k  =  I  and  /  =  {i}, 
say,  then  A/  =  Ha,  the  I'th  diagonal  element  of  H.  So,  rank  influence 
of  a  case  is  fully  explained  by  its  leverage,  as  defined  by  Hoaglin  and 
Welsch  (1978).    For  k  =  2  and  /  =  {ij},  then  A/  =  [Hu  +  Hjj  + 

Ha  -  Hjjy  +  4Hf-]/2.  To  identify  rank-influential  pairs  directly 


V( 


then,  we  must  consider   I         ]   pairs  /  and  their  associated  A/.  To 


n 


generalize  to  groups  of  size  A;  >   2,  we  must  solve    I     ,     j    A;-order 

polynomials  for  their  largest  roots.  The  combinatorial  and  numerical 
complexities  of  this  approach  are  clearly  non-negligible.  (However, 
some  computational  savings  could  be  had  by  exploiting  relationships 
like  I  c  I'  =>Xi  <  A/..) 

Also,  methods  based  on  the  singular  values  or  condition  number  of 
the  reduced  matrices  Xfj\  suff"er  from  similar  computational  burdens. 
The  effects  on  the  singuleir  values  of  a  matrix  when  A;  rows  are  deleted 
depend  on  the  SVD  of  Hi  as  well.  Kempthorne  (1985)  discusses  this 
problem  and  gives  an  exact  characterization  when  A;  =  1  and  p  =  2.  In 


general,  the  solution  requires  determining  the  smallest  (and  largest) 
root  of  a  p— th  order  polynomial  for  every  case  group  whose  rank 
influence  is  in  question.  When  p  >  2  and  k  is  small,  Hadi  (1988)  and 
Hadi  and  Wells  (1989),  propose  excellent  approximations  to  changes 
in  the  extreme  singular  values  and  the  condition  number  when  a  case 
group  is  deleted  from  the  model  matrix.  However,  even  with  such 
approximations  the  combinatorial  search  over  possible  subsets  of  cases 
remains. 

We  propose  methods  in  the  next  section  with  the  express  intent  of 
avoiding  a  search  with  modest  to  high  numerical  and  combinatorial 
complexities.  They  rely  on 


Theorem  1.  When  X  is  an  n  x  p  model  matrix  of  full  rank,  a  group  of 
cases  /  is  rank  influential,  that  is,  A/  =  1,  if  and  only  if  there  exists 
a  non-null  vector  u  &  C,  the  column  space  of  X,  satisfying  u,  —  0,  for 
all  t  ^  /. 

Proof. 

The  case  group  /  is  rank  influential 
<=>     ^[i]  is  rank  deficient 
<;=>      3  1  eW  such  that  X(/)/  =  0 
<^=>      3  lew  such  that  [Xa\,  ^0,^  i^  I 
<;=>      3  uE  C  such  that  Uj  =  0  V  »  ^  /. 

4      Identifying  Rank-Influential  Groups 

By  Theorem  1  we  can  detect  rank-influential  groups  by  searching  C, 
the  column  space  of  X,  for  any  vectors  that  are  closely  aligned  to  a 
few  coordinate  axes  and  perpendicular  to  the  rest. 

The  first  method  exploits  factor-analytic  rotation  algorithms  to 
identify  such  n  — vectors  u  whose  nonzero  components  identify  rank- 
influential  case  groups. 

4.1      Search  by  Quartimax/Varimax  Rotation 

Any  unit  vector  u'  G  C  can  be  expressed  as  u'  =  Ua  for  some  unit 
vector  a  G  9i^,  where  U  is  any  n  x  p  matrix  whose  columns  form 
an  orthonormal  basis  for  the  column  space  of  X  (which  we  assume 
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has  full  rank).  The  matrix  Q  can  be  calculated,  for  example,  as  the 
"Q"  matrix  in  the  QiJ-decomposition  of  X  or  as  the  left  column- 
orthonormal  matrix  in  a  SVD  of  of  X. 

Consider  the  matrix  U'  =  UA,  where  A  is  any  p  x  p  orthogonal 
matrix.  The  columns  of  U'  also  form  an  orthonormal  basis  for  C  If 
the  first  column  of  A  is  a,  then  the  first  column  of  U'  is  u',  as  before. 
For  any  group  I,  a  lower  bound  for  A/  can  easily  be  calculated  using 
[/',  as  we  now  show. 


Theorem  2.    Suppose  C/'  is  an  n  X  p  matrix  whose  columns  form  an 
orthonormal  basis  for  C,  the  column  space  of  X.  Then 


-\j  >   max  {^[Ul^]'} 
i<j<P   .^j 


Proof.  Suppose  that  the  size  of  group  /  is  k.  if  the  maximum  eigen- 
value of  Hi  is  A/,  then,  for  all  A:-vectors  v,  v^Hiv/v^v  <  A/.  Let  v 
be  the  fc- vector  with  components  C/,;,  t  G  /  for  any  fixed  j.  It  follows 
that  v^Hiv  =  E.e/[f^of .  because  Hi  =  UjUj.  Since  v^^/t;  <  A/ 
and  j  is  arbitrary,  the  conclusion  of  the  theorem  follows. 

One  way  to  find  rank-influential  groups  then  is  to  transform  U 
orthogonally  to  U'  in  such  a  way  that  the  columns  of  U'  have  as  many 
as  possible  zero  or  near-zero  elements  and  a  few  elements  with  large 
magnitudes.  If  there  is  a  case  group  /  and  a  column  j  for  which 
U\-  =  0,  for  .■  ^  /,  or  equivalently  E.e/l^ijf  =  1'  *hen  /  is  a  rank- 
influential  group  by  Theorems  1  and  2. 

Consider  transforming  UioU'  =  UA  by  applying  the  varimax/quartimax 
rotation  A.  Kaiser  (1958)  proposed  the  varimax  method  for  transform- 
ing a  matrix  U  to  U' ,  orthogonally,  so  as  to  maximize 

p 


where 


'P,=  '-tPU''{^-tKf]'' 


j=  1,2,..., p. 


The  term  rpj  is  the  variance  of  the  squared  elements  in  the  jth  column 
of  U'.  In  our  case  where  U  is  column  orthonormal,  the  sum  of  the 
squared  elements  in  each  column  of  U  is  one,  so  the  varimax  criterion 
is  equivalent  to  choosing  the  orthogonal  transformation  to  maximize 
the  quartimax  criterion 

These  equivalent  rotation  criteria  were  proposed  in  factor  analy- 
sis to  achieve  interpretable  factors  from  column-orthonormal  loading 
matrices.  In  particular,  it  is  argued  that  maximization  of  tp  or  ip* 
results  potentially  in  "a  factor  matrix  with  a  maximum  tendency  to 
have  both  small  and  large  loadings"  (Kaiser  1958,  p.   188). 

We  now  propose  a  four-step  method  for  identifying  rank-influential 
groups. 


Method  1:  Quartimax/Varimax  Search  of  C. 

Step  1:   Construct  U,  a  column-orthonormal  matrix  whose  columns 
are  a  basis  for  C,  the  column  space  of  X. 

Step  2:   Transform  U  to  f/',  applying  the  varimax/quartimax  rota- 
tion. 

Step  3:  For  each  column  j  of  U'  examine  the  squared  elements  and 
identify  those  groups  /  for  which  [Ul-Y  ^0,i  ^  I  and  [t/^]^  >  0, 

Step  4:  For  any  group  /  identified  in  Step  3,  calculate 

/-/=     max     Y.[Ul,\\ 

the  lower  bound  for  A/;  if  f/  =  1,  then  classify  case  group  /  as 
rank  influential. 

Although  our  interest  focuses  on  case  groups  /  for  which  //  =  1, 
it  is  also  useful  to  identify  those  groups  with  T/  less  than  one  but  still 
large.  Draper  and  John  (1981)  propose  the  measure  \Ik  -  Hj\  as  a 
multiple-cases  extension  of  the  leverage  used  by  J.W.  Tukey,  Huber 
(1975),  and  Hoaglin  and  Welsch  (1978).  Our  measure  //  provides  an 
upper  bound  for  (1  -  li)  for  \Ik  -  Hi\.  Draper  and  John  discuss  how 
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their  mecisure  can  be  interpreted  as  a  factor  in  the  statistic  proposed 
by  Andrews  and  Pregibon  (1978)  to  assess  the  influence  of  case  groups 
in  linear  regression  analysis. 

Cook  and  Weisberg  (1982,  Section  3.6)  propose  A/  and  \Ik  -  Hj\ 
as  possible  measures  of  the  "potential"  of  group  /  in  linear  regres- 
sion. When  a  case  group  has  high  potential,  a  small  change  in  the 
observed  vector  y  (in  certain  directions)  can  result  in  large  changes  in 
the  value  of  the  least  squares  estimate  $  =  {X'^ X)~^X'^y.  They  use 
A/  to  construct  upper  bounds  on  the  value  of  a  statistic  proposed  by 
Cook  (1977,  1979)  to  measure  the  distance  between  the  least  squares 
estimates  of  (3  when  a  case  group  is  included  or  excluded  from  the 
data.  The  statistic  //  we  propose  to  asess  rank  influence  could  also 
be  employed  as  a  diagnostic  to  screen  the  data  for  case  groups  whose 
presence  in  the  data  may  have  a  large  impact  on  the  values  of  fitted 
parameters. 

We  now  present  three  examples.  The  first  is  cin  appUcation  to 
real  data  which  succeeds  in  identifying  a  single  rank-influential  case 
group.  The  second  uses  artificial  data  to  illustrate  how  the  Quarti- 
max/Varimax  Search  can  identify  p  distinct  case  groups.  The  third 
example  also  uses  simulated  data  and  illustrates  how  the  method  can 
fail. 


Example  1.  The  Florida  Area  Cumulus  Experiment  was  conducted 
in  1975  to  determine  the  eff'ect  of  seeding  clouds  with  silver  iodide  to 
increase  the  amount  of  rainfall.  Cook  and  Weisberg  (1982)  use  data 
from  this  experment  to  study  the  detection  of  influential  observations 
in  regression.  They  give  (pp.  3-4)  the  data  together  with  complete 
descriptions  of  each  VEiriable.  For  our  purposes  we  note  that  y  is  the 
response  variable,  a  measure  of  the  amount  of  rain,  and  the  others  are 
explanatory  variables,  where  the  variable  "A"  indicates  seeding  (l)  or 
no  seeding  (0). 

The  following  linear  regression  model  is  proposed: 

logwy    =    ^o  +  l3iA  +  l32T  +  03{S  -  N,)+^4C  +  l3slogioP  +  l36E 

+^7[A  X  (5  -  Ne)]  +  ps[A  X  C]  +  /3g{A  X  logioP]  +  f3io[A  x  E] 

+    e 

The  cross-product  variables  are  included  to  model  a  possible  non- 
additive  treatment  effect  due  to  seeding.  The  X  matrix  for  this  model 


has  24  rows  and  11  columns. 

Accepting  the  legitimacy  of  fitting  such  a  complex  model  with 
only  24  observations,  we  apply  the  Quartimax/Varimax  Search  to 
identify  rank-influential  case  groups.  Table  1  gives  U,  the  left  column- 
orthonormal  matrix  in  the  SVD  of  X,  whose  columns  are  a  basis  for 
C,  the  column  space  of  X.  Note  that  U  reveals  no  rank-influential 
groups;  that  is  no  column  has  nonnegligible  elements  in  only  a  few 
rows.  The  transformed  matrix  U'  was  obtained  by  a  varimax  rotation 
of  U,  and  Table  2  gives  the  squares  of  the  elements  in  U' .  A  brief 
examination  of  the  columns  of  U'  reveals  that  in  the  seventh  column, 
all  the  elements  are  very  close  to  zero  except  those  in  rows  3  and  20. 
For  case  group  /  =  {3,20},  a  simple  calculation  gives  Ij  =  .999997. 
Since  the  numerical  calculations  were  performed  in  single  precision, 
we  can  conclude  that  A/  =  1,  that  is,  /  is  rank  influential.  Indeed, 
when  cases  3  and  20  are  removed  from  the  data,  the  variables  A  and 
A  X  E  are  identical  and  the  X  matrix  is  rank  deficient. 

The  rank  influence  of  this  pair  was  noted  by  Cook  and  Weisberg 
(1982,  p.  145),  but  they  calculated  A/  for  all  pairs  /.  They  also  note 
that  any  pair  including  case  2  has  "high  potential."  Using  column 
2  of  U' ,  we  find  that  f{2}  =  -97283,  confirming  that  case  2  has  high 
potential.  Although  Ij  will  exceed  f{2}  for  ^^y  P^i^  ^  including  case  2, 
the  increase  is  always  smaller  than  1%.  This  supports  their  opinion 
that  case  2  ought  not  to  be  viewed  as  a  member  of  a  pair  (other  than 
with  case  20). 

Once  a  rank-influential  case  group  has  been  identified,  it  is  useful 
to  understand  how  the  group  aff"ects  the  the  estimability  of  /?.  The 
estimability  of  certain  contrasts  of  0  are  unaff"ected.  In  terms  of  the 
SVD  of  U(j\,  the  contrasts  rendered  inestimable  if  case  group  /  is 
unavailable  are  4'a  =  a^/?,  for  a  in  the  space  spanned  by  the  right 
singular-vectors  of  U  corresponding  to  unit  singular  values. 


Example  2.  To  demonstrate  that  the  Quartimax/Varimax  Search  can 
identify  multiple  remk-influential  case  groups,  consider  the  model  mci- 
trix  cissociated  with  a  trivial  analysis  of  covariance  experiment: 
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1 

0 

1 

1 

0 

1 

1 

0 

2 

1 

0 

2 

0 

1 

3 

0 

1 

3 

X  = 


The  first  two  columns  identify  one  of  two  treatments  and  the  third 
column  gives  the  values  of  a  covariate.  The  regression  model  consists 
of  two  parallel  lines  describing  the  relationship  between  the  response 
and  the  covariate  with  the  difference  between  the  lines  giving  the 
treatment  effect.  It  is  easily  shown  that  the  least-squares  fit  to  the 
slope  of  the  parallel  lines  is  inestimable  if  either  /  =  {1,2}  or  /'  = 
{3,4}  is  excluded.  Also,  the  treatment  effect  cannot  be  estimated  if 
/"  =  {5,6}  is  excluded. 

The  left  column-orthonormal  matrix  from  the  SVD  of  X  is 


2117 

.4282 

-.5214 

2117 

.4282 

-.5214 

3844 

.3725 

.4620 

3844 

.3725 

.4620 

5544 

-.4218 

-.1213 

5544 

-.4218 

-.1213 

u 


The  three  rank-influential  groups  /,  /',  /"  are  not  readily  apparent 
in  this  basis  matrix  for  C.  But,  they  are  highlighted  upon  applying 
the  Quartimax/Varimax  rotation  A  to  U  giving 


U'  =  UA  = 


While  the  Quartimax/Varimax  Search  was  successful  in  examples 
1  £ind  2,  it  is  limited  to  identifying  at  most  p  rank-influential  case 
groups  and,  multiple  groups  must  lie  in  orthogonal  subspaces.  More- 
over, there  is  no  guarantee  that  any  rank-influential  case  groups  will 


0 

0 

-.7071 

0 

0 

-.7071 

0 

.7071 

0 

0 

.7071 

0 

7071 

0 

0 

7071 

0 

0 
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.500 

.289 

.500 

.289 

0 

.577 

0 

.577 

7078 

-.409 

indeed  be  found.  This  last  shortcoming  is  demonstrated  in 


Example  S.  Suppose  that  the  column-orthonormal  matrix  correspond- 
ing to  a  model  matrix  X  is 


U  = 


The  groups  /  =  {1,2,5},  and  /'  =  {3,4,5},  are  the  minimal  rank- 
influential  groups  of  size  no  larger  than  n  —  p=3.  (Of  course,  any 
subset  of  size  4  is  trivially  rank  influential.)  The  identity  rotation 
clearly  identifies  /  as  rank  influential.  Another  rotation  could  yield  a 
U'  with  a  column  identifying  /'  by  simply  rotating  so  that  a  coordi- 
nate axis  goes  through  the  point  (.500,  .289).  When  the  matrix  U'  is 
constrained  to  be  orthogonal,  these  two  subsets  cannot  be  identified 
simultaneously.^ 

Allowing  oblique  rotations  might  seem  an  appropriate  extension  of 
the  orthogonal  Quartimax/Vcirimax  Search,  but  even  then,  there  may 
be  more  than  p  rank-influential  groups.  While  these  features  clearly 
limit  the  potential  applicability  of  the  Quartimax/Varimax  Search,  in 
this  case  the  varimax  rotation  of  the  matrix  U  yields: 


U' 


0.5580 
0.5580 
0.1501 
0.1501 
0.5763 


0.1490 
0.1490 
0.5571 
0.5571 
-0.5788 


Neither  rank-influential  subset  is  identified!  Thus,  the  Quartimax/Vari- 
max Search  may  fail. 

To  illustrate  the  optimization  underlying  the  Quartimax/Varimax 
search,  consider  Figure  1.  The  five  cases  are  plotted  in  the  plane  given 
by  the  first  two  columns  of  the  matrix  U.  The  single  case  is  plotted 
with  an  asterisk  and  the  case  pairs  which  coincide  in  the  design/model 
space  are  represented  with  xs.   The  closed  curve  illustrates  the  value 


^I  am  grateful  to  a  referee  for  bringing  this  example  to  my  attention. 
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of  the  Quartimax  criterion  xp'  for  any  rotation  as  the  curve's  distance 
from  the  origin  when  it  intersects  any  set  of  orthogonal  axes.  The  axes 
with  intersection  furthest  from  the  origin  are  the  quartimax  axes. 

While  the  Quartimax/Varimax  method  may  succeed  in  identifying 
rank-influential  groups  for  particular  examples,  it  suffers  from  several 
potential  problems.  First,  at  most  p  rank-influential  groups  can  be 
identified.  Second,  the  basis  vectors  in  U'  are  constrained  to  be  or- 
thogonal. As  in  example  3,  distinct  rank-influential  groups  may  he  in 
oblique  subspaces.  Third,  the  Quartimax  criterion  for  specifying  the 
optimal  rotation  of  a  basis  matrix  U  may  not  be  appropriate.  In  the 
next  section,  investigation  of  a  component-wise  version  of  the  Quarti- 
max criterion  will  demonstrate  that  it  has  a  tendency  to  identify  the 
most  outlying  case  groups  in  the  configuration  given  by  the  rows  of 
U.  This  is  evident  in  Figure  1  for  Excmiple  3  -  the  most  outlying  case 
(corresponding  to  row  5  of  U)  essentially  defines  the  first  Quartimax 
axis. 

4.2      Search  by  Projection  Pursuit 

The  identification  of  rank-influential  subsets  is  equivalent  to  a  projec- 
tion pursuit  problem  with  the  n  points  in  p— space  whose  coordinates 
are  given  by  the  rows  of  the  column-orthonormal  matrix  U.  Consider 
the  problem  of  finding  interesting  projections  oi  U  :  u  =  Ua,  for 
some  a  E  S  =  {a  G  Si^  :  |a|  =  1}.  Recent  applications  of  projec- 
tion pursuit  have  defined  interesting  as  those  projections  achieving 
a  high  value  of  J{u),  where  the  function  J  is  some  index  based  on 
the  empirical  distribution  of  the  components  of  u,  for  example,  the 
value  of  the  chi-squared  statistic  for  testing  normality,  the  value  of 
the  Kolmogorov  statistic  for  testing  normality,  the  modified  version 
of  the  Friedman-Tukey  index,  absolute  skewness,  absolute  kurtosis, 
or  Friedman's  Legendre  polynomial  indices;  see  Huber  (1988),  also 
Huber  (1985,  1989). 

The  absolute  kurtosis  index,  J{u)  =  ^"  l^il"*,  suggests  itself  natu- 
rally in  our  problem  of  identifying  rank-influential  case  groups  because 
it  corresponds  to  a  one-dimensional  or  component-wise  version  of  the 
Quartimax  criterion.  With  an  index  such  as  this,  we  now  propose 
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Method  2:  Projection  Pursuit  Search  of  C- 
Step  1:  Specify  the  Projection  Pursuit  Index  J,  for  example,  J{u)  = 

Step  2:  Construct  U,  a  column-orthonormal  matrix  whose  columns 
are  a  basis  for  C,  the  column  space  of  X. 

Step  3:  Find  the  unit  p-vector  a  which  maximizes  J{Ua). 

Step  4:   Examine  the  squared  elements  of  u  =  Ua  and  identify  the 
group  /  for  which  u^  w  0,  »  ^  /  and  u^  >  0,  :'  €  /. 

Step  5:  For  group  /  identified  in  Step  3,  calculate 

the  lower  bound  for  A/;  if  f/  =  1,  then  classify  case  group  I  as 
rank  influential. 

Step  6  (Optional):   Repeat  Steps  3-5  in  the  subspace  orthogonal  to 
the  maximizing  a  (and  orthogonal  to  any  subsequent  as). 

Huber  (1989)  details  two  approaches  for  identifying  maximal  di- 
rections a  in  Step  3,  depending  upon  the  nature  of  the  index  J. 
For  smooth  indexes  he  proposes  a  variant  of  the  conjugate  gradient 
method  and  for  rough  indexes,  a  random  search  with  similarities  to 
simulated  annealing. 


Example  8  (Continued).  Figure  2  illustrates  the  results  of  a  Projec- 
tion Pursuit  Search  of  C  specified  by  the  columns  of  U  in  Example 
3.  We  define  interesting  projections  of  (7  as  those  with  high  values 
of  the  component-wise  Quartimax  (equivalently,  absolute  Kurtosis) 
criterion.  As  in  Figure  1,  the  five  cases  are  plotted  in  the  plane  given 
by  the  first  two  columns  of  the  matrix  U.  The  closed  curve  now  illus- 
trates the  value  of  the  component  wise  Quartimax  criterion  J{u)  as 
the  curve's  distance  from  the  origin  when  it  intersects  any  unit  vector 
a  emanating  from  the  origin.  The  Quartimax  Axis  1  maximizes  the 
index  and  Quartimax  Axis  2  is  a  local  maximum  of  the  index  which 
happens  to  be  orthogonal  to  the  first.  (These  axes  are  identified  by 
the  Projection  Pursuit  Search  using  Huber's  (1988)  conjugate  gradient 
search  with  his  equivalent  KURT  index.) 
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Our  analysis  of  Example  3  raises  the  issue  of  the  appropriate- 
ness of  the  component  Quartimax  criterion  for  the  Projection  Pur- 
suit Search  (and  of  the  corresponding  matrix  criterion  for  the  Quarti- 
max/Varimax  Seeirch)  to  identify  rank-influential  case  groups.  As 
noted  in  the  projection  pursuit  literature,  this  criterion  (which  is 
equivalent  to  the  absolute  kurtosis)  finds  directions  exhibiting  out- 
liers; see,  e.g.,  Huber  (1988,  p.  6). 

Our  search  for  rank-influential  case  groups  has  the  complementary 
purpose  -  identifying  projections  in  which  many  cases  are  inliers,  in- 
deed, extreme  inliers  coincident  at  the  origin.  That  these  two  criteria 
are  related  in  our  problem  is  seen  by  noting  that  the  orthonormality 
of  the  projections  of  U  impose  the  relationship 

When  the  case  group  /  is  outlying  to  a  maximal  extent,  the  left-hand 
side  of  this  expression  is  1,  and  the  cases  i  ^  I  coincide  at  the  origin. 

Apparently,  the  Quartimax  Criterion  may  overlook  those  rank- 
influential  case  groups  which  consist  of  cases  which  are  not  outlying  in 
the  same  direction  (e.g.,  group  /  =  {1,2,5}  in  Example  3).  However, 
when  they  do  lie  in  a  similar  direction,  as  in  Example  1,  then  the 
criterion  can  succeed. 

For  general  application  of  the  Projection  Pursuit  Search  to  suc- 
ceed, alternative  projection  indexes  are  needed.  Consider  the  class  of 
indices 

n 

Jo.[u)  =  Y.[l-  |u.r],    0<a<  1. 
t=l 

With  values  of  a  close  to  zero,  the  index  identifies  directions  with 
multiple  inliers  coincident  at  the  origin.  In  particular, 

lim  Jq(u)  =  #{j  :  Uj  =  0}. 

This  feature  of  the  indices  suggests  that  we  focus  on  J^  with  a  «  0. 
Define 


1 

J*{u)  =  Um  —Ja{u)  =  y[-ln\ui\). 

1  =  1 

With  this  new  index  specifying  a  Projection  Pursuit  Search,  we  now 


15 


return  to  Example  3. 


Example  S  (Continued).  Figure  3  presents  the  version  of  Figures  1 
and  2  with  the  index  J'  rather  thain  the  matrix/component  Quarti- 
max  criterion.  The  axes  defined  by  the  extreme  values  of  the  J*  index 
identify  all  the  basic  rank-influential  groups.  While  these  axes  are 
readily  apparent  from  the  figure,  they  are  also  identified  by  the  Pro- 
jection Pursuit  Search.  We  note  that  modification  of  Huber's  (1988) 
conjugate-gradient  search  algorithm  was  necessary  because  maxima 
of  the  index  J*  have  singular  rather  than  null  derivatives  and,  local 
to  the  maxima,  the  Hessian  is  positive  definite  rather  than  negative 
definite. 

Group  {1,2,5}  is  identified  by  the  horizontal  axis,  group  {3,4,5} 
is  found  by  projecting  onto  the  axis  with  negative  slope  and  the  group 
{1,  2,  3,4}  is  determined  by  the  axis  with  positive  slope.  Other  ("non- 
basic")  rank-influential  groups  are  those  which  contain  one  of  these 
three  basic  groups. 


To  investigate  the  effectiveness  of  the  Projection  Pursuit  Search  of 
C  using  the  index  J*,  consider 


Example  4-  Five  column-orthonormal  matrices  with  n  =  20  by  p  =  2 
were  constructed  containing  single  rank-influential  case  groups  rang- 
ing in  size  from  A;  =  1  to  A;  =  16.  Each  panels  of  Figure  5  illustrates 
the  configurations  of  the  20  cases  in  two  dimensions  together  with  a 
curve  giving  the  value  of  J*  in  terms  of  the  curve's  distance  from  the 
origin  in  projection  directions  a.  While  the  number  of  maxima  of  J* 
increases  with  the  size  of  the  rank-influential  group  k,  the  index  dis- 
tinguishes rank-influential  groups  even  when  their  size  exceeds  50%  of 
the  data.  Consequently,  we  expect  this  index  to  prove  useful  in  a  wide 
variety  of  applications  where  rank  influential  groups  of  even  modest 
size  are  sought. 
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5      Summary  and  Concluding  Remarks 

While  the  identification  of  rank-influential  groups  of  cases  is  trivial 
in  the  case  of  single-case  groups  and  straightforward  with  case  pairs, 
the  combinatorial  and  numerical  complexity  of  their  direct  identifica- 
tion for  larger  sized  groups  is  considerable.  Two  alternative  methods 
to  combinatorics-based  searches  are  proposed  which  use  searches  in 
p-dimensional  representations  of  the  n  cases  of  a  regression  model. 
The  focus  of  these  searches  are  directions  in  which  most  cases  are 
extreme  inliers,  coincident  at  the  origin. 

The  first  method  constructs  a  column-orthonormal  basis  matrix 
for  the  model  matrix  using  the  Quartimax/Varimax  rotation  of  any 
such  basis  matrix.  While  not  always  successful,  the  method's  compu- 
tational requirements  are  modest  and  it  can  identify  rank-influential 
groups  whose  cases  are  outlying  in  the  design  space  in  similar  (that 
is,  other  than  orthogonal)  directions. 

The  second  method  applies  projection  pursuit  using  a  new  pro- 
jection index  {J")  which  highlights  all  rank  influential  groups,  that 
is,  its  maximal  values  occur  in  projections  identifying  rank-influential 
groups.  The  computations  for  the  projection  pursuit  search  are  more 
involved  than  the  Quartimax-search,  but  they  are  less  intensive  than 
the  alternative  combinatorial  search  for  the  extreme  roots  of  p-th 
order  polynomials  where  the  polynomials  are  indexed  by  potentially 
rank-influential  groups. 

We  conclude  with  two  remarks. 

Remark  A.  The  median  absolute  deviation  (MAD)  has  been  suggested 
as  a  possible  index  for  the  projection  pursuit  search  of  degenerate 
projections.^  We  did  not  explore  use  of  this  criterion  for  two  reasons. 
First,  the  gradient  of  the  index  at  any  projection  direction  depends 
only  on  one  design  point  which  typically  corresponds  to  just  one  case 
or  a  small  number  of  replicates.  In  contrast,  the  gradient  of  the  index 
J*  depends  on  all  cases.  Second,  the  MAD  criterion  would  lead  to 
the  identification  of  only  those  rank-influential  groups  of  size  smaller 
than  Y  Consequently,  it  would  not  succeed  in  Example  3  where  the 
rank-influential  groups  constituted  60%  of  the  cases. 

Remark  B.  Rousseeuw's  minimum-volume  ellipsoid  (MVE)  has  been 


^Personal  communication  with  A.C.  Atkinson. 
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suggested^  as  potentially  useful  in  identifying  rank-influential  obser- 
vations; see  Rousseeuw  and  Leroy  (1987,  p.  158).  When  there  is  a 
rank-influential  group  for  a  given  model  matrix,  no  finite-volume  el- 
lipsoid based  on  groups  of  cases  outside  the  group  will  contain  all  the 
data.  The  MVE  relies  upon  full-rank  configurations  of  elemental  sets 
of  cases.  Of  course,  an  ad-hoc  method  based  on  the  MVE  might  be 
explored  which  perturbs  a  model  matrix  by  the  addition  of  very  small 
errors  to  every  element.  However,  if  the  rank-deficiency  corresponds 
to  a  direction  in  which  there  is  httle  variability,  then  such  "fuzzing 
up"  of  the  data  might  hide  the  rank  deficiency. 


'Personal  communication  with  R.D.  Cook. 
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Table  1.  Q  Matrix  for  the  FACE  DATA 
( X  100000. ) 


COLUMN 


10 


11 


-732. 

17391. 

50709. 

-15004. 

39300. 

673. 

-14744. 

-13944. 

3442. 

-25163. 

-9990. 

-3530. 

92327. 

-14636. 

27516. 

-13263. 

-2137. 

-4056. 

-3950. 

14. 

28ol . 

-1501. 

-1935. 

10094. 

-5123. 

-54935. 

-12386. 

-18991. 

-32175. 

25410. 

-25718. 

10354. 

10533. 

-2220. 

6231. 

20903. 

-14082. 

-17323. 

-9874. 

3330. 

5842. 

33673. 

-1536. 

27S7S. 

-3504. 

17098. 

-6853. 

-44396. 

-22708. 

11645. 

22559. 

-17784. 

-8564. 

-11656. 

3390. 

-4614. 

7123. 

25S72. 

-10676. 

19924. 

-51092. 

19038. 

-43796. 

-27949. 

27970. 

-14559. 

-8635. 

2089. 

15032. 

153. 

9537. 

-13809. 

-524. 

-13870. 

40333. 

19498. 

52944. 

-11956. 

1162. 

16372. 

-6983. 

-28643. 

-14603. 

16863. 

4569. 

-1053. 

15573. 

2709. 

-13236. 

9737. 

41245. 

-1437. 

16919. 

32389. 

-2239. 

15221. 

8347. 

6075. 

14162. 

-13512. 

7086. 

-8856. 

-10233. 

16415. 

-20650. 

31506. 

30818. 

18053. 

-4902 . 

-17572. 

-13986. 

4742. 

-10290. 

-37145. 

-12749. 

19930. 

13319. 

-25634. 

-7723. 

-5930. 

7077. 

-15213. 

604. 

-9437. 

-21920. 

3456. 

220-1. 

6858. 

314. 

20476. 

8664. 

-21254. 

-15773. 

185o. 

22035. 

-4329. 

-30967. 

8323. 

-3967. 

21566. 

12769. 

-20395. 

-2035. 

-16611. 

372. 

.^9951. 

-17845. 

10007. 

-4983. 

19872. 

103S2. 

14207. 

-1983. 

-13231. 

-13273. 

9166. 

-11149. 

1579. 

20540. 

-15302. 

13190. 

24332. 

27988. 

7954. 

-25399. 

-187SS. 

6237. 

37057 . 

-677. 

-4526. 

26254. 

9522. 

20000. 

-19375. 

11055. 

-13434. 

-24393. 

-5923. 

9204. 

8293. 

-15063. 

-31720. 

-32705. 

-13173. 

10117. 

-60935. 

-22593. 

-25923. 

-3133. 

-14231. 

-12251. 

80-1. 

30761. 

-40384. 

-35531. 

34494. 

36474. 

-32697. 

-26182. 

-9133. 

1234. 

9772. 

-JOOwC. 

-18Z42. 

1798. 

1043. 

5554. 

101. 

4c54. 

-2S035. 

3707. 

-15900. 

-7047. 

25C.22. 

-17411. 

-47107. 

30431. 

-22119. 

10449. 

20739. 

-30523. 

-6018. 

-15310. 

-2428. 

25163. 

1384. 

23920. 

5090. 

-1447. 

-17913. 

10440. 

-31740. 

-11383. 

14. 

20673. 

-13530. 

-17237. 

2003. 

-125:-:7. 

2524. 

20733. 

18336. 

-33338. 

2976. 

-21290. 

7594. 

22149. 

21592. 

11371. 

-21876. 

-19493. 

-25261. 

33191. 

-3S9S0. 

-8455. 

15472. 

22464. 

-14306. 

11637. 

6173. 

7023. 

-33231. 

17677. 

-18549. 

Table  2.  Squared  Elements  of  Matrix  Q'  for  FACE  DATA 

(  X  100000. ) 


COLUMN 


8 


10 


11 


10. 

0. 

51060. 

6. 

58. 

3063. 

0. 

6. 

1. 

3709. 

100. 

13. 

97233. 

0. 

9. 

0. 

0. 

0. 

0. 

8. 

0. 

0. 

3816. 

65. 

10. 

8013. 

53. 

107. 

49891. 

133. 

122. 

57. 

87. 

2670. 

6. 

272. 

2663. 

?. 

1113. 

0. 

587. 

208. 

367. 

22637. 

409. 

393. 

15. 

38S77. 

97. 

161. 

0. 

17. 

5. 

87. 

141. 

99. 

1. 

438. 

55. 

75. 

78411. 

0. 

58. 

6. 

2. 

42. 

641. 

3. 

179. 

497. 

1594. 

317. 

0. 

221. 

47. 

943. 

51486. 

2959. 

8. 

2288. 

2821. 

6806. 

1714. 

0. 

729. 

223. 

2. 

2959. 

194. 

1. 

19632. 

122. 

814:3. 

2719. 

0. 

88. 

13. 

3340. 

4225. 

91. 

8. 

3. 

24. 

15. 

31. 

0. 

2067. 

34363. 

17. 

22. 

4730. 

106. 

3. 

26333. 

25. 

33. 

0. 

1357. 

379. 

18. 

34. 

170. 

182. 

2. 

2273. 

12. 

17. 

0. 

6342. 

903o. 

9. 

16. 

2440. 

5. 

414. 

2496. 

7634. 

806r:. 

0. 

509. 

193. 

55*7. 

1137. 

241. 

2«4. 

1. 

1573. 

9. 

14. 

0. 

163. 

14700. 

8. 

12. 

304. 

5S9. 

0. 

zzs:-.. 

0. 

0. 

0. 

921. 

30990. 

0. 

•  0. 

317. 

1. 

5355. 

344. 

26649. 

113. 

0. 

61. 

26. 

2109. 

967. 

5o. 

0. 

465. 

31. 

80. 

0. 

0, 

32. 

4. 

76799. 

203. 

61. 

1. 

3. 

10. 

11. 

33. 

0. 

84-00. 

15. 

20. 

19. 

226. 

0. 

96?  4. 

261. 

6067. 

0. 

0. 

36. 

IS. 

6251. 

3573. 

3720. 

64. 

10. 

7896. 

58. 

107. 

50108. 

132. 

121. 

57. 

87. 

18554. 

792. 

3, 

86. 

16. 

33. 

0. 

1026. 

8335 . 

18. 

25. 

1^32. 

6. 

8399. 

1657. 

4934. 

2635. 

0. 

5ol. 

145. 

436. 

89  o5. 

55230 . 

IS*. 

16. 

65o . 

87. 

183. 

0. 

126. 

404. 

9S. 

13c. 

1117. 

4. 

II?-?. 

950. 

37513. 

1030. 

0. 

324. 

83. 

0. 

3120. 

Figure  1.  Five-Point  Representation  of  U  for  Example  3. 
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Figure  2.  Quartimax  Index  Plot  for  Example  3. 
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Figure  3.  Component  Quartimax  Index  Plot  for  Example  3. 
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Figure  4.  J*  Index  Plot  for  Example  3. 
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